Add fraud-detection example (IEEE-CIS)#140
Conversation
Self-contained reproduction of Weco's fraud-detection case study. Downloads the Kaggle dataset, builds a leakage-safe 100K/25K time-based parquet split, and exposes train.py as the optimization target (feature engineering + LightGBM config both modifiable). evaluate.py prints auc_roc for Weco. instructions.md is the full EDA + techniques prompt from the case study — column semantics for each feature group (TransactionAmt, C/D/M/V), 10 well-known IEEE-CIS techniques (UID construction, target encoding with OOF, velocity features, frequency encoding), and a target-leakage guardrail pointing out the isFraud-in-df aggregation trap. README walks through Kaggle API setup, prepare_data step, baseline sanity check (~0.914 AUC), and the canonical weco run command (gemini-3.1-pro-preview, 50 steps, expected trajectory into 0.928-0.933). Also adds 'things to try' (no-instructions variance blow-up, EDA-only ablation, scope restriction) and a silent-target-leakage watch-out pointing to the published case study. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…er test Two fresh-agent test rounds surfaced three issues; all fixed: - kaggle CLI: the `kaggle` package has no __main__, so `python -m kaggle` crashes with ModuleNotFoundError. Correct entry point is `kaggle.cli`. - venv instruction used `python -m venv`, which fails on Debian/Ubuntu systems where only `python3` exists (no python-is-python3). Changed to `python3 -m venv`. After activation `python` resolves correctly. - pip-install fails on modern PEP 668 systems without a venv. README now leads with the venv setup before the install step, with a note on why. Also: prepare_data.py now catches Kaggle CalledProcessError and prints the two most common root causes (rules not accepted / kaggle.json perms) with the exact URL to accept the competition rules. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 92cb31d6a4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| y_val = val_df["isFraud"].values.astype(np.int32) | ||
|
|
||
| n_train = len(train_df) | ||
| df = pd.concat([train_df, val_df], axis=0, ignore_index=True) |
There was a problem hiding this comment.
Fit feature aggregations on training data only
build_features concatenates train_df and val_df before creating grouped amount statistics and frequency encodings, so validation rows (future data in this time-based split) directly shape the engineered features used for evaluation. That leaks validation distribution into the pipeline and can systematically inflate the reported AUC that Weco optimizes against. Compute these encodings/aggregations from train_df only, then map them onto val_df with defaults for unseen keys.
Useful? React with 👍 / 👎.
Codex flagged that the baseline concatenates train + val before computing groupby aggregations and frequency encodings, letting val-period distribution shape train features and letting each val row influence its own encoded values. Even with isFraud dropped first, this is time-leakage that inflates val AUC vs. what would be seen at serving time. Fix: compute all encoders (card1/addr1 amount stats, frequency encoding) on train_df only; .join/.map onto both splits; fill unseen val keys with train-global defaults. Refactored per-row features (time, amount) into a small helper so both splits share that code path without concat. Baseline AUC drops from the previously-reported 0.914 to 0.910 — the right number, not artificially inflated. Expected Weco trajectory (0.928- 0.933 at 200 steps with full instructions) unchanged in shape; case study absolute numbers used the leaky baseline so they shift slightly here. Also expanded instructions.md and README to distinguish target leakage (isFraud in the dataframe during aggregation) from time leakage (val distribution in the encoder fit), with the fit-on-train / apply-to-both pattern spelled out for future encoders Weco proposes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous prepare_data.py used pandas df.sample(random_state=42), which produced parquets with shape, fraud rate, and DT range matching the original case-study but DIFFERENT row content — baseline AUC came out at 0.9023 instead of the case-study's 0.9102. Recovered the original ad-hoc prep recipe from a Claude Code session transcript and rewrote to match. Two recipe details that turned out to matter: 1. Stratified train subsample preserving fraud rate, using a single global np.random.seed(42) followed by sequential np.random.choice calls (NOT pandas df.sample). The val subsample inherits the advanced RNG state. 2. Label-encode using categories from concat(train, val), and include "string" alongside "object" in select_dtypes — pandas 3 uses StringDtype for string columns and skips them when only "object" is included, silently leaving them as raw strings (which would then crash LightGBM or be dropped before fit). Verified locally: re-running this prepare_data.py from a fresh Kaggle download produces parquets with SHA-256s train: a2d7a6740559975b8e6d89bd605f1e29791dd7d3fee8abc6449552bbc18d29ae val: 8b426c8bf7fa845bc234dbce304b1107fd295143fac2398bab97b78805f50753 matching the case-study originals exactly. Baseline AUC = 0.910171. README updated to reflect the now-deterministic 0.9102 baseline (the previous "0.910 because we removed the leak" gloss was misleading — the parquets themselves were different from the case-study). Reframed the 0.914 reference as the case-study's leaky-baseline AUC for clarity. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Recent CLI versions ship important fixes — most relevant here, 0.3.31 added queue-mode submit recovery (`_recover_queue_suggest`) and a native `AutoResumePolicy` that together make the transient `Failed to submit result` race invisible to the user. Anyone with an older weco in their venv (e.g. operators reusing weco-gpu's pinned 0.3.25) was hitting this race and silently terminating runs short of their step budget. Switching the install command to `pip install --upgrade -r requirements.txt` ensures users picking up this example always get the latest fixes, regardless of what's pre-installed in their venv. Comment in the README explains why we never pin weco-cli. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…to -loose
The previous fraud-detection example exposed `build_features(train_df, val_df)`
in a single file. The agent could (and frequently did) `pd.concat([train, val])`
and silently introduce time-leakage in encoders. We measured the inflation at
0.001-0.005 AUC depending on parquet contents, and found that prompt-level
"fit on train only" warnings only achieved ~67% compliance across seeds.
The new fraud-detection/ example uses a fit/transform interface:
features.py: class FeatureBuilder with fit(X_train, y_train) + transform(X)
model.py: train_and_evaluate(X_train, y_train, X_val, y_val) -> float
evaluate.py: frozen orchestrator that strips isFraud, calls fb.fit then
fb.transform twice, and runs the model.
This kills two classes of leakage at the interface:
- isFraud is dropped before X reaches features.py (target leakage out).
- val data is never visible to fit() (time leakage out).
- transform() has no y argument (val labels can't influence val features).
Weco optimizes:
- features.py and model.py separately for scope=features / scope=model
- both together (`--sources features.py model.py`) for scope=full
The file boundary IS the scope boundary — no leaky helper module needed.
Existing single-file example renamed to fraud-detection-loose/ and kept as
a comparison artifact. README in fraud-detection/ links to it.
Baseline AUC: 0.909132 (deterministic; ~0.001 below the loose version's
0.910171 — that's the leakage inflation in the loose baseline).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The features.py docstring says y_train is a pd.Series (so users can call
.values, .map, .to_dict on it for OOF target encoding). Earlier evaluate.py
passed the result of .values.astype("int32") which is a numpy ndarray,
breaking any proposal that did `y_train.values` or `y_train.map(...)`.
Sanity-checked on a 3-seed Weco run: with the Series fix, proposals proceed
past step 1 instead of crashing on AttributeError.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
What's in the example
Verification
Two rounds of fresh-agent testing caught and fixed: venv prereq on modern Python installs; `python3` vs `python` on Ubuntu; `kaggle` package has no `main` so needed `kaggle.cli`. Final sanity check blocked on `403 Forbidden` from the Kaggle API (rules-accept is a per-user prereq, called out in the README).
Test plan
🤖 Generated with Claude Code